Exploratory Analysis of Entropy Inside "The Office"
"The Office" is a popular American television series that originally aired on NBC from March 24, 2005, to May 16, 2013.
Created by Greg Daniels, the show is a mockumentary-style sitcom that is a remake of the British series of the same name created by Ricky Gervais and Stephen Merchant.
The show is set in the Scranton, Pennsylvania, branch of the Dunder Mifflin Paper Company and follows the daily lives of the employees working there.
Some of the key characters include:
- Michael Scott (played by Steve Carell) - "the boss", Regional Manager of Dunder Mifflin Paper Company, Scranton Branch
- Dwight Shure (played by Rainn Wilson) - Assistant (to) Regional Manager and beet enthusiast
- Jim Halpert (played by John Krasinski) - paper salesman, who likes to prank Dwight and spends too much time at the reception desk
- Pam Beesly (played by Jenna Fischer) - the receptionist and artist wannabe
The dataset used in this analyis is a complete dialogue transcript of The Office (US) series, found at https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript/data
It containst 45,000+ lines of dialogue from 9 seasons of The Office. It's relatively clean and good quality, however, in "The Data" part I take some measures to make it more appropriate for the analysis.
The Hypotheses
In order to establish the working order for this exploratory analysis, I set up some hypotheses.
Entropy reflects characters personality traits
Entropy reflect the "flow" of the show
However, they are not scientific enough. Let's translate them into more low-level operational hypotheses.
- Entropy measures of dialogue in The Office scripts correspond to the personality traits and behavioral complexity of individual characters.
- The Shannon entropy of a character’s lines is positively associated with how linguistically diverse and complex their personality appears on-screen. Characters with high dialogue entropy will appear more unpredictable or multifaceted, while characters with low dialogue entropy will appear more consistent or stereotypical.
- Characters with higher variance in entropy across episodes or seasons demonstrate more personality development (e.g., evolving speech patterns), whereas characters with stable entropy are portrayed as static or consistent.
- Entropy measures of The Office are indicative of the narrative pacing and thematic complexity of the show.
- Seasons with higher word entropy (measuring the diversity of word usage in a given unit of dialogue) reflect faster or more complex pacing, while lower word entropy reflects slower, more focused narrative flow.
- Seasons with higher combined entropy across all characters’ lines indicate a broader range of interactions, while episodes with lower combined entropy reflect more tightly focused plots or limited interactions.
In order to test these hypotheses thoroughly, I would need some standardized personality tests of the characters or external measurements like season ratings, reviews etc. However, in this analysis, I wanted to solely focus on the script itself, as I've previously conducted a mixed analysis of show using this dataset - https://www.kaggle.com/datasets/nehaprabhavalkar/the-office-dataset
Therefore, the hypothesis testing process itself will not be entirely scientific - I'll base it on my subjective judgements, which I deem to be rather accurate after watching the entire show 5 times and being the self-proclaimed ***The Office* expert**.
Spoiler Alert
Warning!
The following report does contain spoilers of the show.
The Libraries
#importing built-in libraries
import re
from collections import Counter
from math import log2
#importing numpy and pandas for data manipulation
import numpy as np
import pandas as pd
#importing scipy.stats for boostrapping
from scipy.stats import bootstrap
#importing plotly and cufflinks for creating visualizations
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
import cufflinks as cf
cf.go_offline()
# setting default template to plotly_dark for all visualizations
pio.templates.default = "plotly_dark"
# for charts to be rendered properly
init_notebook_mode()
#for saving the images (commented out for the report)
#import os
#import kaleido
The Data
Time to spill some beans.
#reading the data
office = pd.read_csv('The-Office-Lines-V4.csv', encoding='latin-1')
#dropping the repeated index columns
office = office.drop('Unnamed: 6', axis=1)
The 'Office' dataset contains all the dialogues in the show, along with the name of the speaker and some other information.
office.head()
| season | episode | title | scene | speaker | line | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Pilot | 1 | Michael | All right Jim. Your quarterlies look very good... |
| 1 | 1 | 1 | Pilot | 1 | Jim | Oh, I told you. I couldn't close it. So... |
| 2 | 1 | 1 | Pilot | 1 | Michael | So you've come to the master for guidance? Is ... |
| 3 | 1 | 1 | Pilot | 1 | Jim | Actually, you called me in here, but yeah. |
| 4 | 1 | 1 | Pilot | 1 | Michael | All right. Well, let me show you how it's done. |
season : season number
episode : episode number
title : episode title
scene : scene number
speaker : speaker in the scene
line : lines of the speaker
Before going any further, let's check if there are any missing values in the datasets.
#checking for missing values
office.isnull().sum()
season 0 episode 0 title 0 scene 0 speaker 0 line 0 dtype: int64
No NAs or NULLs - great success! Now, let's see how many speakers are there in the series.
print(office['speaker'].unique())
office['speaker'].unique().shape
['Michael' 'Jim' 'Pam' 'Dwight' 'Jan' 'Michel' 'Todd Packer' 'Phyllis' 'Stanley' 'Oscar' 'Angela' 'Kevin' 'Ryan' 'Man' 'Roy' 'Mr. Brown' 'Toby' 'Kelly' 'Meredith' 'Travel Agent' 'Man on Phone' 'Everybody' 'Lonny' 'Darryl' 'Teammates' 'Michael and Dwight' 'Warehouse worker' 'Madge' 'Worker' 'Katy' 'Guy at bar' 'Other Guy at Bar' 'Guy At Bar' 'Pam and Jim' 'Employee' "Chili's Employee" 'Warehouse Guy' 'Warehouse guy' 'Man in Video' 'Video' 'Actor' 'Redheaded Actress' "Mr. O'Malley" 'Albiny' "Pam's Mom" 'Carol' 'Bill' 'Everyone' 'Crowd' 'song' 'Song' 'Dwight and Michael' 'Sherri' 'Creed' 'Devon' 'Children' 'Kid' 'Ira' "Ryan's Voicemail" 'Christian' 'Hostess' 'Michael and Christian' 'Sadiq (IT guy)' 'Mark' 'Improv Teacher' 'Mary-Beth' 'Girl acting Pregnant' 'Actress' 'Michael and Jim' 'Kevin & Oscar' 'All' 'Liquor Store Clerk' 'JIm' 'Bob Vance' 'Phyllis, Meredith, Michael, Kevin' 'Captain Jack' 'Brenda' 'Darryl and Katy' 'Jim and Pam' 'Billy Merchant' 'Doctor' 'Lab Tech' 'Dana' "Hooter's Girls" 'Phylis' 'Gil' 'Pam and others' 'Ed' 'Packer' 'Todd' "Jim's voicemail" 'Guy' 'Group chant' 'All the Men' 'Delivery man' 'Craig' 'Josh' 'David' 'Dan' 'Overhead' 'Speaker' 'Jim and Dwight' 'Melissa' 'Sasha' 'Abby' 'Jake' 'The Kids' 'Kids' 'Miss Trudy' 'Edward R. Meow' 'Chet' 'Young Michael' 'Delivery Woman' 'Delivery Boy' 'Office Staff' 'Store Employee' 'Pam/Jim' 'Linda' 'Hank' 'I.D. Photographer' 'Photographer' 'Anglea' 'Female worker' "Billy's Girlfriend" 'Billy' 'Dealer' 'Bob' 'Andy' 'Karen' 'Jerome Bettis' 'Ted' 'Waiter' 'Jim, Josh, and Dwight' 'Evan' 'Alan' 'Ryan and others' 'Announcer' 'Pretzel guy' 'Cousin Mose' 'Tony' 'Server' 'Girls' "Kelly's Mom" "Kelly's Father" 'Young Man' 'Andy and Jim' 'Dwight ' 'M ichael' 'Michael ' 'Dwight:' 'Hannah' 'Martin' 'Male voice' 'Michael & Dwight' 'Andy & Michael' 'Waitress' 'Chef' 'Woman at bar' 'Cindy' 'Second Cindy' 'Other waitress' 'Andy and Michael' 'Both' 'Harvey' 'Buyer' 'Kenny' 'Julius' 'Phone' 'Staples Guy' 'MIchael' 'Lady' 'Paris' 'Marcy' 'Ben Franklin' 'Elizabeth' 'Priest' 'Uncle Al' 'Randy' 'Unknown' 'Women' 'College Student' 'Business Student #1' 'Business Student #2' 'Business Student #3' 'Woman' 'Artist' 'Rachel' 'Dan Gore' 'Bartender' 'Student 1' 'Student 2' 'Child' 'Hunter' 'Darry' 'Micheal' 'Chad Lite' 'Jamie' 'Barbara' 'School Official' 'Group' 'Receptionist' 'IT Tech Guy' 'Nurse' 'Intern' 'Robert Dunder' 'Amy' 'GPS' 'Larry Myers' 'Ex-client' 'Voice of Thomas Dean' 'sAndy' 'DunMiff/sys' 'DwightKSchrute' 'Tech Guy' 'Angels' 'Pizza guy' 'Manager' 'Voice #1 on phone' 'Voice #2 on phone' 'Micahel' 'Michae' 'Nick' 'Mose' 'Co-Worker 1' 'Stanely' 'Micael' 'Vikram' 'Co-Worker 2' 'Co-Worker 3' 'Mr. Figaro' 'Oscar and Stanley' 'Ad guy 1' 'Ad guy 2' 'David Wallace' 'Andy, Creed, Kevin, Kelly, Darryl' 'Andy, Creed, Kevin, Kelly' "Michael's Ad" 'Rolando' 'Ben' 'Lester' 'Diane Kelly' 'Diane' 'Deposition Reporter' 'Council' "Hunter's CD" 'Officer 1' 'Officer 2' 'Officer' "Wendy's phone operator" 'Margaret' 'Coffee shop worker' 'W.B. Jones' 'Paul Faust' 'Bill Cress' 'Paul' 'Michael/Dwight' 'Troy' 'Girl in Club' 'Tall Girl #1' 'All Girls' 'Tall Girl #2' 'Girl in 2nd club' 'Cleaning lady' 'Michael and Darryl' 'Phil Maguire' 'Phil' 'Justin' 'Angela and Dwight' 'Maguire' 'Woman on mic' 'Graphics guy' 'Holly' 'Woman over speakerphone' 'Vance Refrigeration guy' 'Holy' 'Ronnie' 'Professor' 'Friend' 'JIM9334' 'Receptionitis15' 'Michael & Holly' 'Dight' 'Kendall' 'Man on phone' 'Hank ' 'Guy in audience' 'Michael and Holly' 'Michael, Holly, and Darryl' 'Tom' 'Pete' 'Mother' 'Alex' 'Customer' 'Stewardess' 'Beth' 'Concierge' 'Marie' 'Guy at table' 'Concierge Marie' 'Client' 'Dacvid Walalce' 'David Wallcve' 'Dacvid Wallace' 'Leo' 'Vance Refrigeration Guy' 'Police Officer 1' 'Police Officer 2' 'Guy buying doll' 'Rehab Nurse' 'Everyone watching' 'Entire Prince family' 'Prince Grandfather' 'Entire office' 'Jim ' 'Prince' 'Prince Granddaughter' 'Prince Grandmother' 'Prince Son' 'Phyllis and Creed' 'Lawyer' 'CPR trainer' 'CPR Trainer' 'Rose' 'Jessica Alba' 'Lily' 'Sam' 'Warehouse Michael' 'Julia' 'A.J.' 'Phone Salesman' 'Jim, Pam, Michael and Dwight' 'Blood Drive Worker' 'Blood Girl' 'Lynn' 'Blonde' 'Eric' 'Girl' 'Charles' 'Stephanie' 'Employees' 'Isaac' 'Angela and Kelly' 'Supervisor' 'Michal' 'Nana' 'Chares' 'Old Woman' 'Erin' 'Dwight and Erin' 'Dwight and Andy' 'Michael, Pam & Ryan' 'Secretary' 'Automated phone voice' 'Mr. Schofield' 'Financial Guy' 'Ty' 'Jessica' 'Vance Refrigeration Guy 1' 'Vance Refrigeration Guy 2' 'VRG 1' 'VRG 2' 'Rolph' 'AJ' 'Man from Buffalo' 'Woman from Buffalo' 'Dwight & Andy' 'Female Intern' 'Female intern' 'Maurie' 'Megan' 'Gwenneth' 'Front Desk Clerk' 'Mr. Halpert' 'Mema' 'Mr. Beesly' 'Little Girl' 'Penny' 'Isabel' 'Hotel Employee' 'Hotel Manager' "Pam's mom" 'Tom Halpert' 'Pete Halpert' 'Tom and Pete' "Pam's dad" 'Grotti' 'Andy and Dwight' 'Credit card rep' 'Rep' 'Various' 'Keena Gifford' 'Helene' "David Wallace's Secretary" 'Voice on CD player' 'Limo Driver' 'Jim & Pam' 'Laurie' 'Registrar' 'Security' 'Woman in line' 'Man in line' 'Shareholder' 'Female Shareholder' 'Second Shareholder' 'Third Shareholder' 'Fourth Shareholder' "O'Keefe" 'Mikela' 'Students' 'Teacher' 'Lefevre' 'Zion' 'Deliveryman' 'Michael and Erin' 'Daryl' 'Office' 'Kelly and Erin' 'Matt' 'Computron' 'Fake Stanley' 'Gabe' 'Andy & Erin' 'Christian Slater' 'Jo Bennett' 'Jo' 'Jerry' 'Teddy Wallace' 'Mrs. Wallace' 'Teddy' 'Dwight, Jim and Michael' 'Policeman' 'Hospital employee' "(Pam's mom) Heleen" 'Kathy' 'Dale' 'Clark' ' Jim' 'Isabelle' 'D' 'Warehouse guy 1' 'Warehouse guy 2' 'Reid' 'Night cleaning crew' 'Miichael' 'Dwight: ' 'Michael: ' 'Jim: ' 'Meredith: ' 'Angela: ' 'Creed: ' 'Phyllis: ' 'Everyone: ' 'Oscar: ' 'Stanley: ' 'Matt: ' 'Warehouse Guy: ' 'Darryl: ' 'Andy: ' 'Pam: ' 'Erin: ' 'Kevin: ' 'Julie: ' 'Isabel: ' 'Hide: ' 'Ryan: ' 'Kelly: ' 'Bar Manager: ' 'Bouncer: ' 'Girl at table: ' 'Cookie Monster' 'Dwight.' "Hayworth's waiter" "Oscar's voice from the computer" 'Donna' 'Mihael' 'Hide' 'Old lady' 'Glen' 'Gym Instructor' 'Gym instructor' 'Dwight and Angela' 'Shane' 'Reporter' 'Realtor' 'Luke' 'Window treatment guy' 'Angel' 'Salesman' 'Usher' 'Shelby' 'Sweeney Todd' 'Son' 'Nate' 'Employees except Dwight' 'Astrid' 'Carroll' 'Carrol' 'Danny' 'Steve' 'Darryl and Andy' 'Church congregation' 'Pastor' ' Pastor' 'Female church member' 'Male church member' 'Doug' 'Mee-Maw' 'MeeMaw' 'Carla' "Jim's Dad" 'Bus driver' 'Michael and Andy' 'Another guy' 'Radio' 'TV' 'Meridith' 'Robotic Voice' 'Ryan and Michael' 'Phyliss' 'Dwight & Nate' 'Passer-by' 'Pam ' 'Bass Player' 'Justine' 'Jada' 'Robert' 'Darrly' 'Member' 'Video Michael' 'Bookstore employee' 'DJ' 'David Brent' 'Older guy' 'Phyllis, Stanley, Dwight' 'Younger Guy' 'Older Woman' 'Professor Powell' 'Ryan and Kelly' 'Helen' 'Attendant' 'Hot Dog Guy' 'Cell Phone Sales Person' 'Boom Box' 'Andy and Erin' 'Delivery' 'Samuel' 'President' 'Goldenface' 'Cherokee Jack' 'Michael and Samuel together' "Holly's Mom" "Holly's Dad" 'Deangelo' 'Deangelo/Michael' 'Denagelo' "Darryl's sister" 'DeAngelo' '"Jo"' '"Angela"' '"Jim"' '"Phyllis"' 'Together' 'Audience' 'Erin and Kelly' 'abe' 'Rory' 'DeAgnelo' 'Jordan' 'All but Oscar' ' Jo' 'Darryl and Angela' 'Fred Henry' 'Fred' 'Warren Buffett' 'Warren' 'Robert California' 'Merv Bronte' 'Merv' 'Nellie Bertram' 'Nellie' 'Finger Lakes Guy' 'Pam as "fourth-biggest client"' 'Pam as "ninth-biggest client"' 'Tattoo Artist' 'Female Applicant' 'Male Applicant 1' 'Male Applicant 2' 'Gideon' 'Bruce' 'Dwight, Erin, Jim & Kevin' 'Walter' 'Ellen' 'Walter Jr' 'Andy & Walter' 'Walter & Walter Jr' "Erin's Cell Phone" 'Bert' 'Gabe/Kelly/Toby' 'Andy/Pam' 'Andy/Stanley' 'Val' 'Warehouse Crew' 'Cathy' 'Offscreen' 'Curtis' 'Drummer' 'Pam and Kelly' 'Old Man' 'Andy and Darryl' 'Darryl and Kevin' 'Park Ranger' 'Chelsea' "Chelsea's Mom" 'Archivist' 'Narrator' 'Soldier' 'Amanda' 'Susan' 'Andy/Oscar' 'Host' 'Queerenstein Bears' "Oscar's friend" 'Stu' 'Stonewall Host' 'Senator Lipton' 'Ernesto' 'Cece' 'Saleswoman' 'Emergency Operator' 'Paramedic' 'Donna Muraski' 'Wally Amos' 'Angela/Pam' 'Brandon' 'Blogger' 'Blogger 2' 'Lady Blogger' 'Patty' 'Old Lady' 'Others' 'Elderly Woman' 'Irene' 'Alonzo' 'Glenn' 'Kevin & Meredith' 'Lauren' 'Party guests' 'Magician' 'Ravi' 'Robert & Creed' 'Wrangler' 'Senator' 'Vet' 'Harry' 'Mr. Ramish' 'Calvin' 'Off-camera' 'Rafe' 'Fake Jim' 'Voicemail' 'Nellie and Pam' 'Video Andy' 'Phyllis, Kevin & Stanley' 'HCT Member #1' 'HCT Member #2' 'Broccoli Rob' 'Businessman #1' 'Businessman #2' 'Businessman #3' 'HCT' 'HCT Member #3' 'White' 'Boat Guy' 'Walt Jr.' 'Senator Liptop' 'Business partner' 'Molly' 'Colin' 'Trevor' 'Julius Irving' 'New Instant Message' 'Suit Store Father' 'Athlead Employee' 'Dennis' 'Wade' 'Suit Store Son' 'Female Athlead Employee' '3rd Athlead Employee' '4th Athlead Employee' 'Co-worker' 'Co-worker #2' 'Mr. Romanko' 'Dance Teacher' 'Ballerinas' 'Parent in Audience' 'Parent in audience #2' 'Parent in audience #1' 'Investor' 'Lonnie' 'Fast Food Worker' 'Drive Thru Customer' 'Brian' 'Cameraman' 'Rolf' 'Gabor' 'Zeke' 'Melvina' 'Wolf' 'Sensei Ira' 'Frank' 'Party Announcer' 'Party Guest' 'Party Photographer' 'Party Waiter' 'Nail stylist 1' 'Nail stylist 2' 'Nail manager' 'Shirley' 'Athlead Coworker' 'Roger' 'Alice' "Oscar's Computer" 'Jeb' 'German Minister' 'Fannie' 'Henry' 'Esther' 'Aunt Shirley' 'Cameron' 'Promo Voice' 'Ryan Howard' 'Mr. Ruger' 'Ruger Sister 1' 'Salesmen' 'Ruger Sister 2' 'Angela & Oscar' 'Reporter #1' 'Reporter #2' 'Mrs. Davis' 'Carla Fern' 'Director' 'Producer' 'Bob Vance, Vance Refrigeration' 'Production Assistant' 'Sensei' 'Philip' 'Check-in guy' 'Casey' 'Mark McGrath' 'Jim & Dwight' 'Camera Crew' 'Phillip' 'People in line' 'Santigold' 'Aaron Rodgers' 'Clay Aiken' 'Camera Man' 'Malcolm' 'Casey Dean' 'Seth Mayers' 'Bill Hader' 'Dakota' 'Stripper' 'Jakey' 'Man 1' 'Woman 1' 'Woman 2' 'Man 2' 'Moderator' 'Man 3' 'Woman 3' 'Woman 4' 'Joan' 'Minister' 'Carol Stills']
(775,)
Well, there's a lot of them - 775 to be exact. However, not all of them are unique characters, e.g. one example is two lines described as "Andy & Michael" and "Andy and Michael" - counted as two, despite being the same characters. Therefore, one can assume that there are more reocurring spelling errors.
In this analysis, I'm mainly focusing on the core characters of the series, i.e. Michael Scott, Dwight Schrute, Jim Halpert and Pam Beesly.
The Character Entropy
Let's start by cleaning the data.
def formatLine(line):
line = line.lower()
line = re.sub(r'[^\w\s]','',line)
return line
office['line_formatted'] = office['line'].apply(lambda x:formatLine(x))
office['line_formatted'].head()
0 all right jim your quarterlies look very good ... 1 oh i told you i couldnt close it so 2 so youve come to the master for guidance is th... 3 actually you called me in here but yeah 4 all right well let me show you how its done Name: line_formatted, dtype: object
In this step, I've created a properly formatted version of each line. Using regex, I changed the text to be only in lowercase and removed any unnecessary symbols, like commas or dots, because further analysis concerns only letters and words.
Next, in order to calculate Shannon Entropy, it's obviously required to have a proper function. In this case, I defined it myself, using a simple function seen below.
def entropy(text):
counter = Counter(text)
total = sum(counter.values())
return -sum( count/total * log2(count/total) for count in counter.values())
After defining the function, I applied it to the entire dataframe - to both formatted and not formatted lines, in order to see if it works properly.
office['entropy'] = office['line'].apply(entropy)
office['entropy_formatted'] = office['line_formatted'].apply(entropy)
office.head()
| season | episode | title | scene | speaker | line | line_formatted | entropy | entropy_formatted | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Pilot | 1 | Michael | All right Jim. Your quarterlies look very good... | all right jim your quarterlies look very good ... | 4.239712 | 3.999839 |
| 1 | 1 | 1 | Pilot | 1 | Jim | Oh, I told you. I couldn't close it. So... | oh i told you i couldnt close it so | 3.851149 | 3.364299 |
| 2 | 1 | 1 | Pilot | 1 | Michael | So you've come to the master for guidance? Is ... | so youve come to the master for guidance is th... | 4.245317 | 3.975739 |
| 3 | 1 | 1 | Pilot | 1 | Jim | Actually, you called me in here, but yeah. | actually you called me in here but yeah | 3.927418 | 3.675892 |
| 4 | 1 | 1 | Pilot | 1 | Michael | All right. Well, let me show you how it's done. | all right well let me show you how its done | 3.987594 | 3.695948 |
As expected, the formatted lines have lower entropy, as there are less symbols to cause the chaos.
Now, it's time to plot the entropy distribution.
# defining data for the plot
x0 = office['entropy']
x1 = office['entropy_formatted']
# starting the plot
fig = go.Figure()
fig.add_trace(go.Histogram(x=x0, name = 'Original'))
fig.add_trace(go.Histogram(x=x1, name = 'Formatted'))
# overlaying the histograms
fig.update_layout(barmode='overlay', title_text='Entropy of Original and Formatted Lines')
fig.update_xaxes(title_text='Entropy')
fig.update_yaxes(title_text='Count')
# reducing opacity for better visibility and showing the plot
fig.update_traces(opacity=0.75)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_formatted_vs_not_formatted.png', engine="kaleido")
Original lines exhibit a broader range of entropy, including both very low and very high values, suggesting greater linguistic variability in unprocessed dialogue. In contrast, formatted lines show more concentrated peaks, particularly at lower entropy values (around 1 and 2), likely reflecting the effects of normalization or repetitive phrases. Both distributions converge in the higher entropy range (3.5–4.5), which likely corresponds to the rich, diverse dialogue characteristic of the show's humor and character interactions. This indicates that formatting reduces variability while preserving the core structure of the more dynamic, unpredictable lines central to the narrative flow.
Creating a smaller dataframe for each major character.
michael = office[office['speaker'] == 'Michael']
dwight = office[office['speaker'] == 'Dwight']
jim = office[office['speaker'] == 'Jim']
pam = office[office['speaker'] == 'Pam']
andy = office[office['speaker'] == 'Andy']
toby = office[office['speaker'] == 'Toby']
stanley = office[office['speaker'] == 'Stanley']
kelly = office[office['speaker'] == 'Kelly']
ryan = office[office['speaker'] == 'Ryan']
phyllis = office[office['speaker'] == 'Phyllis']
oscar = office[office['speaker'] == 'Oscar']
darryl = office[office['speaker'] == 'Darryl']
jan = office[office['speaker'] == 'Jan']
creed = office[office['speaker'] == 'Creed']
meredith = office[office['speaker'] == 'Meredith']
angela = office[office['speaker'] == 'Angela']
kevin = office[office['speaker'] == 'Kevin']
erin = office[office['speaker'] == 'Erin']
#erin.head()
Calculating mean entropy for each character.
Firstly, regular script lines.
michael_entropy = michael['entropy'].mean()
dwight_entropy = dwight['entropy'].mean()
jim_entropy = jim['entropy'].mean()
pam_entropy = pam['entropy'].mean()
andy_entropy = andy['entropy'].mean()
toby_entropy = toby['entropy'].mean()
stanley_entropy = stanley['entropy'].mean()
kelly_entropy = kelly['entropy'].mean()
ryan_entropy = ryan['entropy'].mean()
phyllis_entropy = phyllis['entropy'].mean()
oscar_entropy = oscar['entropy'].mean()
darryl_entropy = darryl['entropy'].mean()
jan_entropy = jan['entropy'].mean()
creed_entropy = creed['entropy'].mean()
meredith_entropy = meredith['entropy'].mean()
angela_entropy = angela['entropy'].mean()
kevin_entropy = kevin['entropy'].mean()
erin_entropy = erin['entropy'].mean()
Now, formatted lines.
michael_entropy_formatted = michael['entropy_formatted'].mean()
dwight_entropy_formatted = dwight['entropy_formatted'].mean()
jim_entropy_formatted = jim['entropy_formatted'].mean()
pam_entropy_formatted = pam['entropy_formatted'].mean()
andy_entropy_formatted = andy['entropy_formatted'].mean()
toby_entropy_formatted = toby['entropy_formatted'].mean()
stanley_entropy_formatted = stanley['entropy_formatted'].mean()
kelly_entropy_formatted = kelly['entropy_formatted'].mean()
ryan_entropy_formatted = ryan['entropy_formatted'].mean()
phyllis_entropy_formatted = phyllis['entropy_formatted'].mean()
oscar_entropy_formatted = oscar['entropy_formatted'].mean()
darryl_entropy_formatted = darryl['entropy_formatted'].mean()
jan_entropy_formatted = jan['entropy_formatted'].mean()
creed_entropy_formatted = creed['entropy_formatted'].mean()
meredith_entropy_formatted = meredith['entropy_formatted'].mean()
angela_entropy_formatted = angela['entropy_formatted'].mean()
kevin_entropy_formatted = kevin['entropy_formatted'].mean()
erin_entropy_formatted = erin['entropy_formatted'].mean()
show_entropy = office['entropy'].mean()
show_entropy_formatted = office['entropy_formatted'].mean()
Comparing the two entropies.
fig = go.Figure(data=[
go.Bar(name='Raw Text Entropy', x=['Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'], y=[michael_entropy,dwight_entropy,jim_entropy,pam_entropy,andy_entropy,toby_entropy,stanley_entropy,kelly_entropy,ryan_entropy,phyllis_entropy,oscar_entropy,darryl_entropy,jan_entropy,creed_entropy,meredith_entropy,angela_entropy,kevin_entropy,erin_entropy]),
go.Bar(name='Formatted Text Entropy', x=['Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'], y=[michael_entropy_formatted,dwight_entropy_formatted,jim_entropy_formatted,pam_entropy_formatted,andy_entropy_formatted,toby_entropy_formatted,stanley_entropy_formatted,kelly_entropy_formatted,ryan_entropy_formatted,phyllis_entropy_formatted,oscar_entropy_formatted,darryl_entropy_formatted,jan_entropy_formatted,creed_entropy_formatted,meredith_entropy_formatted,angela_entropy_formatted,kevin_entropy_formatted,erin_entropy_formatted])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_layout(title='Character Entropy - raw versus formatted text', yaxis_title='Entropy', xaxis_title='Character')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_by_character_formatted_vs_unformatted.png', engine="kaleido")
Across all names, raw text entropy (purple bars) is consistently higher than formatted text entropy (red bars), reflecting that formatting reduces variability in character distributions, as stated before. The differences between raw and formatted entropy are relatively uniform across all characters, indicating that the formatting process has a consistent impact regardless of individual textual distributions. Overall, this suggests that while the raw text retains more nuanced diversity, the formatting process simplifies the distribution while still maintaining similar overall patterns for each character.
Summing up, the mean entropy is lower in every case after formatting the text. From now on, let's only focus on the entropy of formatted lines.
Calculating mean entropy for each season of the show.
entropy_season_one = office[office['season'] == 1]['entropy'].mean()
entropy_season_two = office[office['season'] == 2]['entropy'].mean()
entropy_season_three = office[office['season'] == 3]['entropy'].mean()
entropy_season_four = office[office['season'] == 4]['entropy'].mean()
entropy_season_five = office[office['season'] == 5]['entropy'].mean()
entropy_season_six = office[office['season'] == 6]['entropy'].mean()
entropy_season_seven = office[office['season'] == 7]['entropy'].mean()
entropy_season_eight = office[office['season'] == 8]['entropy'].mean()
entropy_season_nine = office[office['season'] == 9]['entropy'].mean()
#creating a dataframe for the entropy by season
seasons_entropy = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Entropy': [entropy_season_one, entropy_season_two, entropy_season_three, entropy_season_four, entropy_season_five, entropy_season_six, entropy_season_seven, entropy_season_eight, entropy_season_nine]})
#plotting the data
fig = px.bar(seasons_entropy,
x='Season',
y='Entropy',
title='Character Entropy by Season',
color='Entropy',
text_auto=True,
range_color=[3.5, 3.7],)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_by_season.png', engine="kaleido")
#adding bootstrap confidence intervals
seasons = office['season'].unique()
seasons.sort()
fig = go.Figure()
for season in seasons:
data = office[office['season'] == season]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Entropy by Season')
fig.update_yaxes(title='Mean Entropy')
fig.show()
Season 1 exhibits the highest mean entropy, with a relatively wide confidence interval, suggesting significant variability in dialogue complexity during the initial setup of the show. Seasons 2 through 5 show a decrease in entropy, with season 5 being the lowest point, with narrower intervals, indicating a stabilization and more consistent writing style during these middle seasons. Starting with Season 6, entropy begins to increase, peaking in Seasons 8 and 9, which show greater variability as reflected in their wider confidence intervals. This upward trend in later seasons could reflect the diversification of dialogue and narrative complexity as the series progressed toward its conclusion.
Checking which characters had the highest entropy.
#creating a dataframe out of the formatted values
formatted_entropies = pd.DataFrame({'speaker': ['Show', 'Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'],
'Entropy': [show_entropy, michael_entropy,dwight_entropy,jim_entropy,pam_entropy,andy_entropy,toby_entropy,stanley_entropy,kelly_entropy,ryan_entropy,phyllis_entropy,oscar_entropy,darryl_entropy,jan_entropy,creed_entropy,meredith_entropy,angela_entropy,kevin_entropy,erin_entropy],
'entropy_formatted': [show_entropy_formatted, michael_entropy_formatted,dwight_entropy_formatted,jim_entropy_formatted,pam_entropy_formatted,andy_entropy_formatted,toby_entropy_formatted,stanley_entropy_formatted,kelly_entropy_formatted,ryan_entropy_formatted,phyllis_entropy_formatted,oscar_entropy_formatted,darryl_entropy_formatted,jan_entropy_formatted,creed_entropy_formatted,meredith_entropy_formatted,angela_entropy_formatted,kevin_entropy_formatted,erin_entropy_formatted]})
formatted_entropies
#sorting the speakers by entropy
formatted_entropies = formatted_entropies.sort_values(by='Entropy', ascending=False)
print(list(formatted_entropies['speaker']))
fig = px.bar(formatted_entropies,
x='speaker',
y='Entropy',
title='Character Entropy by Speaker',
color = 'Entropy',
text_auto = True)
fig.update_layout(yaxis_title='Entropy', xaxis_title='Speaker')
fig.add_shape(
name="show",
showlegend=False,
type="rect",
line=dict(dash="dash"),
x0=7.4,
x1=6.6,
y0=0,
y1=3.63,
)
fig.show()
['Kelly', 'Michael', 'Andy', 'Creed', 'Dwight', 'Darryl', 'Ryan', 'Show', 'Toby', 'Stanley', 'Oscar', 'Jan', 'Phyllis', 'Meredith', 'Jim', 'Pam', 'Erin', 'Angela', 'Kevin']
#saving the image
#fig.write_image('graphs/entropy_entropy_by_character_sorted.png', engine="kaleido")
#adding bootstrap confidence intervals for the chosen speaekers
speakers = ['Kelly', 'Michael', 'Andy', 'Creed', 'Dwight', 'Darryl', 'Ryan', 'Toby', 'Stanley', 'Oscar', 'Jan', 'Phyllis', 'Meredith', 'Jim', 'Pam', 'Erin', 'Angela', 'Kevin']
fig = go.Figure()
for speaker in speakers:
data = office[office['speaker'] == speaker]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'{speaker}'))
fig.update_layout(title='Bootstrap Confidence Interval for Entropy by Speaker')
fig.update_yaxes(title='Mean Entropy')
fig.show()
Kelly and Michael exhibit the highest mean entropy, however Creed has the broadest interval - reflecting greater variability, likely tied to his erratic and unpredictable speech patterns. Interestingly, Michaels confidence interval is among the narrowest - possibly reflecting the fact that despite his unpredictability, his character "made sense" - was well-put together and we did not see any deviaitons from his ordinary behaviour.
Andy and Dwight follow closely, showcasing diverse but consistent dialogue indicative of their dynamic and multifaceted roles.
On the other end, Kevin and Angela display the lowest mean entropy, with narrower intervals. In Agnela's case, it highlighs her formulatic and predictable dialogue styles. However, in Kevins case, it probably depict's character's poor linguistic capabilities due to him being the portrayal of a "slow and dumb" character.
Characters like Pam and Jim fall in the middle range, aligning with their stable and relatable personas, with relatively narrow confidence intervals indicating consistency.
This distribution highlights the show's intentional linguistic differentiation, where high-entropy characters like Kelly and Creed add unpredictability and variety, while lower-entropy characters like Kevin and Angela bring stability and comedic simplicity to the narrative.
Now, let's calculate how the entropy for each of the characters has changed over the seasons.
Michael
michael_entropy_season_one = michael[michael['season'] == 1]['entropy'].mean()
michael_entropy_season_two = michael[michael['season'] == 2]['entropy'].mean()
michael_entropy_season_three = michael[michael['season'] == 3]['entropy'].mean()
michael_entropy_season_four = michael[michael['season'] == 4]['entropy'].mean()
michael_entropy_season_five = michael[michael['season'] == 5]['entropy'].mean()
michael_entropy_season_six = michael[michael['season'] == 6]['entropy'].mean()
michael_entropy_season_seven = michael[michael['season'] == 7]['entropy'].mean()
#as Michael was present for (sadly) only 7 seasons, it's not necessary to consider the rest of the seasons for him
#creating a dataframe for the entropy of Michael's lines by season
michael_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7'],
'Entropy': [michael_entropy_season_one, michael_entropy_season_two, michael_entropy_season_three, michael_entropy_season_four, michael_entropy_season_five, michael_entropy_season_six, michael_entropy_season_seven]})
#plotting the entropy of Michael's lines over the seasons])
fig = px.bar(michael_entropies,
x='Season',
y='Entropy',
title='Michael\'s Character Entropy by Season',
color='Entropy',
range_color=[3.4, 3.9],
text_auto = True)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_michael_by_season.png', engine="kaleido")
#adding bootstrap confidence intervals for Michael
seasons = michael['season'].unique()
seasons = seasons[:-1]
fig = go.Figure()
for season in seasons:
data = michael[michael['season'] == season]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Michael\'s Entropy by Season')
fig.update_yaxes(title='Mean Entropy')
fig.show()
In Season 1, Michael exhibits the highest mean entropy with a wider confidence interval, indicating greater variability and linguistic diversity in his dialogue as the character is being established. It indicates a more varied and experimental linguistic style in Michael’s dialogue early in the series - as it was a 1:1 remake of the UK "The Office" series - and the decline represents that his persona was indeed tamed after season 1. From Seasons 2 to 5, entropy gradually declines, reflecting more consistent and potentially formulaic dialogue as Michael's personality and role become more defined.
Entropy stabilizes in Seasons 6 and 7, with narrower intervals suggesting a more predictable and cohesive style in his final seasons. The overall downward trend in entropy from Season 1 to Season 7 aligns with the development of Michael’s character from an unpredictable and erratic figure to a more stable and emotionally nuanced persona as the series progresses. This provides insight into how Michael’s dialogue complexity was adjusted to support his evolving narrative arc.
Dwight
dwight_entropy_season_one = dwight[dwight['season'] == 1]['entropy'].mean()
dwight_entropy_season_two = dwight[dwight['season'] == 2]['entropy'].mean()
dwight_entropy_season_three = dwight[dwight['season'] == 3]['entropy'].mean()
dwight_entropy_season_four = dwight[dwight['season'] == 4]['entropy'].mean()
dwight_entropy_season_five = dwight[dwight['season'] == 5]['entropy'].mean()
dwight_entropy_season_six = dwight[dwight['season'] == 6]['entropy'].mean()
dwight_entropy_season_seven = dwight[dwight['season'] == 7]['entropy'].mean()
dwight_entropy_season_eight = dwight[dwight['season'] == 8]['entropy'].mean()
dwight_entropy_season_nine = dwight[dwight['season'] == 9]['entropy'].mean()
#creating a dataframe for the entropy of Dwight's lines by season
dwight_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Entropy': [dwight_entropy_season_one, dwight_entropy_season_two, dwight_entropy_season_three, dwight_entropy_season_four, dwight_entropy_season_five, dwight_entropy_season_six, dwight_entropy_season_seven, dwight_entropy_season_eight, dwight_entropy_season_nine]})
#plotting the entropy of Dwight's lines over the seasons
fig = px.bar(dwight_entropies,
x='Season',
y='Entropy',
title='Dwight\'s Character Entropy by Season',
color='Entropy',
range_color=[3.4, 3.9],
text_auto = True)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_dwight_by_season.png', engine="kaleido")
#adding bootstrap confidence intervals for Dwight
seasons = dwight['season'].unique()
fig = go.Figure()
for season in seasons:
data = dwight[dwight['season'] == season]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Dwight\'s Entropy by Season')
fig.update_yaxes(title='Mean Entropy')
fig.show()
In Season 1, Dwight exhibits relatively high mean entropy with a wider confidence interval, indicating variability as his character traits are initially explored. In the middle seasons (2–5), the entropy declines slightly and stabilizes, reflecting a more consistent and defined portrayal of his eccentric (yet structured) personality.
Starting in Season 6, Dwight’s entropy begins to increase steadily, reaching its highest levels in Seasons 8 and 9. The broader confidence intervals in these later seasons suggest greater variability, potentially due to the expansion of Dwight's narrative arcs, including his leadership aspirations and more complex storylines. This trend aligns with the show's increasing focus on Dwight as a central figure in the later seasons, showcasing a richer and more diverse linguistic pattern in his dialogue.
Jim
jim_entropy_season_one = jim[jim['season'] == 1]['entropy'].mean()
jim_entropy_season_two = jim[jim['season'] == 2]['entropy'].mean()
jim_entropy_season_three = jim[jim['season'] == 3]['entropy'].mean()
jim_entropy_season_four = jim[jim['season'] == 4]['entropy'].mean()
jim_entropy_season_five = jim[jim['season'] == 5]['entropy'].mean()
jim_entropy_season_six = jim[jim['season'] == 6]['entropy'].mean()
jim_entropy_season_seven = jim[jim['season'] == 7]['entropy'].mean()
jim_entropy_season_eight = jim[jim['season'] == 8]['entropy'].mean()
jim_entropy_season_nine = jim[jim['season'] == 9]['entropy'].mean()
#creating a dataframe for the entropy of Jim's lines by season
jim_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Entropy': [jim_entropy_season_one, jim_entropy_season_two, jim_entropy_season_three, jim_entropy_season_four, jim_entropy_season_five, jim_entropy_season_six, jim_entropy_season_seven, jim_entropy_season_eight, jim_entropy_season_nine]})
#plotting the entropy of Jim's lines over the seasons
fig = px.bar(jim_entropies,
x='Season',
y='Entropy',
title='Jim\'s Character Entropy by Season',
color='Entropy',
range_color=[3.4, 3.9],
text_auto = True)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_jim_by_season.png', engine="kaleido")
#adding bootstrap confidence intervals for Jim
seasons = jim['season'].unique()
fig = go.Figure()
for season in seasons:
data = jim[jim['season'] == season]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Jim\'s Entropy by Season')
fig.update_yaxes(title='Mean Entropy')
fig.show()
In Season 1, Jim’s mean entropy is relatively high with a broad confidence interval, reflecting the initial exploration of his character’s wit and nuanced humor. As the show progresses into Seasons 2 through 5, entropy decreases and stabilizes, likely indicating a more predictable pattern in his dialogue centered around his pranks and interactions with Pam.
From Season 6 onwards, Jim’s entropy gradually increases, with a noticeable rise in Season 9. This may correspond to the expansion of his narrative arc, including his career struggles and more serious moments, which add variety to his speech patterns. The broader intervals in later seasons suggest more variability in his dialogue, consistent with his evolving role in the show. This rise in later seasons shows his real character development, and the low point of Season 3 is a great representation of how his character was feeling during that time - lost, confused and set aside.
Pam
pam_entropy_season_one = pam[pam['season'] == 1]['entropy'].mean()
pam_entropy_season_two = pam[pam['season'] == 2]['entropy'].mean()
pam_entropy_season_three = pam[pam['season'] == 3]['entropy'].mean()
pam_entropy_season_four = pam[pam['season'] == 4]['entropy'].mean()
pam_entropy_season_five = pam[pam['season'] == 5]['entropy'].mean()
pam_entropy_season_six = pam[pam['season'] == 6]['entropy'].mean()
pam_entropy_season_seven = pam[pam['season'] == 7]['entropy'].mean()
pam_entropy_season_eight = pam[pam['season'] == 8]['entropy'].mean()
pam_entropy_season_nine = pam[pam['season'] == 9]['entropy'].mean()
#creating a dataframe for the entropy of Pam's lines by season
pam_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Entropy': [pam_entropy_season_one, pam_entropy_season_two, pam_entropy_season_three, pam_entropy_season_four, pam_entropy_season_five, pam_entropy_season_six, pam_entropy_season_seven, pam_entropy_season_eight, pam_entropy_season_nine]})
#plotting the entropy of Pam's lines over the seasons
fig = px.bar(pam_entropies,
x='Season',
y='Entropy',
title='Pam\'s Character Entropy by Season',
color='Entropy',
range_color=[3.4, 3.9],
text_auto = True)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_pam_by_season.png', engine="kaleido")
#bootstrap confidence intervals for Pam
seasons = pam['season'].unique()
fig = go.Figure()
for season in seasons:
data = pam[pam['season'] == season]['entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Pam\'s Entropy by Season')
fig.update_yaxes(title='Mean Entropy')
fig.show()
In Season 1, Pam’s mean entropy is relatively low, with a wide confidence interval. This reflects her initial characterization as reserved and understated, with limited variety in her dialogue. In Seasons 2 through 5, her entropy gradually increases and stabilizes, peaking in Season 7 - corresponding to her growing confidence and evolving role within the office dynamics, particularly in her interactions with Jim and other colleagues.
By Season 8, Pam’s entropy appears to stabilize further, suggesting a focus on her personal milestones, such as her relationship with Jim and family developments, which may have led to a narrower range of topics in her dialogue. The confidence intervals in these later seasons are slightly wider, indicating more variability in her dialogue, consistent with the increased complexity and autonomy of her character toward the series’ conclusion.
The N-gram Entropy
In this section, instead of calculating character entropies, I will focus on n-gram entropies.
First, let's start by defining a function to calculate n-gram entropies.
def text_to_n_gram_sequence(text, n):
sequence = []
n_gram_dict = {}
next_key = 0
for i in range(len(text) - n + 1):
gram = text[i:i + n]
if gram not in n_gram_dict:
n_gram_dict[gram] = next_key
next_key += 1
sequence.append(n_gram_dict[gram])
return sequence
Now, time for creating a corpora for each character.
michael_corpus = ' '.join(michael['line_formatted'])
dwight_corpus = ' '.join(dwight['line_formatted'])
jim_corpus = ' '.join(jim['line_formatted'])
pam_corpus = ' '.join(pam['line_formatted'])
andy_corpus = ' '.join(andy['line_formatted'])
toby_corpus = ' '.join(toby['line_formatted'])
stanley_corpus = ' '.join(stanley['line_formatted'])
kelly_corpus = ' '.join(kelly['line_formatted'])
ryan_corpus = ' '.join(ryan['line_formatted'])
phyllis_corpus = ' '.join(phyllis['line_formatted'])
oscar_corpus = ' '.join(oscar['line_formatted'])
darryl_corpus = ' '.join(darryl['line_formatted'])
jan_corpus = ' '.join(jan['line_formatted'])
creed_corpus = ' '.join(creed['line_formatted'])
meredith_corpus = ' '.join(meredith['line_formatted'])
angela_corpus = ' '.join(angela['line_formatted'])
kevin_corpus = ' '.join(kevin['line_formatted'])
erin_corpus = ' '.join(erin['line_formatted'])
#checking the first 150 characters of Michael's corpus
print(michael_corpus[:150])
#success
all right jim your quarterlies look very good how are things at the library so youve come to the master for guidance is this what youre saying grassho
And for the whole series.
office_corpus = ' '.join(office['line_formatted'])
#checking the first 150 characters of the entire corpora
print(office_corpus[:150])
#success
all right jim your quarterlies look very good how are things at the library oh i told you i couldnt close it so so youve come to the master for guidan
Now, let's get the entropies for the four main characters.
entropy_michael = [entropy(text_to_n_gram_sequence(michael_corpus, i)) for i in range(1, 20)]
entropy_dwight = [entropy(text_to_n_gram_sequence(dwight_corpus, i)) for i in range(1, 20)]
entropy_jim = [entropy(text_to_n_gram_sequence(jim_corpus, i)) for i in range(1, 20)]
entropy_pam = [entropy(text_to_n_gram_sequence(pam_corpus, i)) for i in range(1, 20)]
And for the whole script.
entropy_office = [entropy(text_to_n_gram_sequence(office_corpus, i)) for i in range(1, 20)]
And for all of the major characters.
entropy_andy = [entropy(text_to_n_gram_sequence(andy_corpus, i)) for i in range(1, 20)]
entropy_toby = [entropy(text_to_n_gram_sequence(toby_corpus, i)) for i in range(1, 20)]
entropy_stanley = [entropy(text_to_n_gram_sequence(stanley_corpus, i)) for i in range(1, 20)]
entropy_kelly = [entropy(text_to_n_gram_sequence(kelly_corpus, i)) for i in range(1, 20)]
entropy_ryan = [entropy(text_to_n_gram_sequence(ryan_corpus, i)) for i in range(1, 20)]
entropy_phyllis = [entropy(text_to_n_gram_sequence(phyllis_corpus, i)) for i in range(1, 20)]
entropy_oscar = [entropy(text_to_n_gram_sequence(oscar_corpus, i)) for i in range(1, 20)]
entropy_darryl = [entropy(text_to_n_gram_sequence(darryl_corpus, i)) for i in range(1, 20)]
entropy_jan = [entropy(text_to_n_gram_sequence(jan_corpus, i)) for i in range(1, 20)]
entropy_creed = [entropy(text_to_n_gram_sequence(creed_corpus, i)) for i in range(1, 20)]
entropy_meredith = [entropy(text_to_n_gram_sequence(meredith_corpus, i)) for i in range(1, 20)]
entropy_angela = [entropy(text_to_n_gram_sequence(angela_corpus, i)) for i in range(1, 20)]
entropy_kevin = [entropy(text_to_n_gram_sequence(kevin_corpus, i)) for i in range(1, 20)]
entropy_erin = [entropy(text_to_n_gram_sequence(erin_corpus, i)) for i in range(1, 20)]
#merging all the entropies into a single dataframe (used later for bootstrap confidence intervals)
entropies = pd.DataFrame({'n': range(1, 20),
'Michael': entropy_michael,
'Dwight': entropy_dwight,
'Jim': entropy_jim,
'Pam': entropy_pam,
'Andy': entropy_andy,
'Toby': entropy_toby,
'Stanley': entropy_stanley,
'Kelly': entropy_kelly,
'Ryan': entropy_ryan,
'Phyllis': entropy_phyllis,
'Oscar': entropy_oscar,
'Darryl': entropy_darryl,
'Jan': entropy_jan,
'Creed': entropy_creed,
'Meredith': entropy_meredith,
'Angela': entropy_angela,
'Kevin': entropy_kevin,
'Erin': entropy_erin,
'Show': entropy_office})
Now, let's plot the entropies for the characters.
fig = make_subplots(rows=2, cols=2, subplot_titles=('Michael', 'Dwight', 'Jim', 'Pam'))
#michael
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_michael, mode='lines', name='Michael'), row=1, col=1)
fig.update_xaxes(type='log', row=1, col=1)
#dwight
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_dwight, mode='lines', name='Dwight'), row=1, col=2)
fig.update_xaxes(type='log', row=1, col=2)
#jim
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_jim, mode='lines', name='Jim'), row=2, col=1)
fig.update_xaxes(type='log', row=2, col=1)
#pam
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_pam, mode='lines', name='Pam'), row=2, col=2)
fig.update_xaxes(type='log', row=2, col=2)
#show the plot
fig.update_layout(height=800, width=800, title_text='Entropies of characters\' lines by n-gram')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_n_grams_main_characters.png', engine="kaleido")
#the show
fig = px.scatter(x=list(range(1, 20)),
y=entropy_office,
title='Entropy of The Office by n-gram',
labels={'x': 'n-gram', 'y': 'Entropy'},
color = entropy_office,)
fig.update_xaxes(type='log', range=[np.log2(1), np.log10(25)])
fig.show()
#saving the image
#fig.write_image('graphs/entropy_n_grams_whole_series.png', engine="kaleido")
Good-looking, but difficult to look at, let's make another plot.
#plotting all on the same graph for easier comparison
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_michael, mode='lines', name='Michael'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_dwight, mode='lines', name='Dwight'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_jim, mode='lines', name='Jim'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_pam, mode='lines', name='Pam'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_office, mode='lines', name='Show'))
fig.update_xaxes(type='log')
fig.update_layout(title='Entropies of characters\' lines by n-gram')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_n_grams_main_characters_all_same_graph.png', engine="kaleido")
The x-axis represents the size of the n-grams, ranging from unigrams (individual words) to larger sequences, while the y-axis captures the cumulative entropy, reflecting the diversity of word combinations used by each character.
Michael and Dwight exhibit relatively similar patterns, with Michael's lines showcasing slightly higher entropy across all n-gram levels. This pattern suggests a broader diversity in Michael’s dialogue, aligning with his unpredictable and often erratic personality traits. Dwight’s trajectory follows closely, indicating his unique yet somewhat consistent speech style that incorporates a blend of rigid formality and occasional absurdity.
Jim and Pam display notably lower entropy throughout, emphasizing the more consistent and stable nature of their dialogue. This could reflect their comparatively grounded and relatable personas, with fewer surprising shifts in their speech patterns. Pam's line is particularly steady, indicating a measured and predictable way of speaking.
The line labeled "Show" represents the overall script entropy, which naturally encompasses the combined dialogue of all characters. It sits higher than any individual line, as expected, capturing the full spectrum of linguistic diversity across all episodes and characters.
Rest of the characters
#plotting entropies for the rest of the characters
fig = go.Figure()
#andy
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_andy, mode='lines', name='Andy'))
#toby
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_toby, mode='lines', name='Toby'))
#stanley
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_stanley, mode='lines', name='Stanley'))
#kelly
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_kelly, mode='lines', name='Kelly'))
#ryan
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_ryan, mode='lines', name='Ryan'))
#phyllis
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_phyllis, mode='lines', name='Phyllis'))
#oscar
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_oscar, mode='lines', name='Oscar'))
#darryl
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_darryl, mode='lines', name='Darryl'))
#jan
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_jan, mode='lines', name='Jan'))
#creed
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_creed, mode='lines', name='Creed'))
#meredith
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_meredith, mode='lines', name='Meredith'))
#angela
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_angela, mode='lines', name='Angela'))
#kevin
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_kevin, mode='lines', name='Kevin'))
#erin
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_erin, mode='lines', name='Erin'))
fig.update_xaxes(type='log')
fig.update_layout(title='Entropies of characters\' lines by n-gram')
fig.show()
Andy stands out with the highest overall entropy across all n-gram sizes, suggesting a broader variability in his speech. This aligns with his eclectic and often erratic personality, characterized by dynamic shifts in tone and phrasing. Kelly follows closely, reflecting her verbose and often animated dialogue style.
Characters like Toby, Stanley, and Angela exhibit lower entropy levels, highlighting their more subdued, predictable, and contextually consistent speech patterns. This fits their respective personas: Toby’s mild-mannered demeanor, Stanley’s disinterest, and Angela’s strict, no-nonsense approach to communication.
Meredith and Creed occupy middle ground, possibly owing to their sporadic but quirky and unexpected lines, which contribute to moderate linguistic variability.
The relatively flat growth curves for characters like Phyllis, Oscar, and Ryan suggest a narrower range of linguistic creativity, consistent with their generally steady and understated dialogue in the series.
And now, let's calculate mean n-gram entropy for each character.
mean_ngram_entropy_michael = np.round(np.mean(entropy_michael), 2)
mean_ngram_entropy_dwight = np.round(np.mean(entropy_dwight), 2)
mean_ngram_entropy_jim = np.round(np.mean(entropy_jim), 2)
mean_ngram_entropy_pam = np.round(np.mean(entropy_pam), 2)
mean_ngram_entropy_andy = np.round(np.mean(entropy_andy), 2)
mean_ngram_entropy_toby = np.round(np.mean(entropy_toby), 2)
mean_ngram_entropy_stanley = np.round(np.mean(entropy_stanley), 2)
mean_ngram_entropy_kelly = np.round(np.mean(entropy_kelly), 2)
mean_ngram_entropy_ryan = np.round(np.mean(entropy_ryan), 2)
mean_ngram_entropy_phyllis = np.round(np.mean(entropy_phyllis), 2)
mean_ngram_entropy_oscar = np.round(np.mean(entropy_oscar), 2)
mean_ngram_entropy_darryl = np.round(np.mean(entropy_darryl), 2)
mean_ngram_entropy_jan = np.round(np.mean(entropy_jan), 2)
mean_ngram_entropy_creed = np.round(np.mean(entropy_creed), 2)
mean_ngram_entropy_meredith = np.round(np.mean(entropy_meredith), 2)
mean_ngram_entropy_angela = np.round(np.mean(entropy_angela), 2)
mean_ngram_entropy_kevin = np.round(np.mean(entropy_kevin), 2)
mean_ngram_entropy_erin = np.round(np.mean(entropy_erin), 2)
mean_ngram_entropy_show = np.round(np.mean(entropy_office), 2)
#creating a dataframe
mean_ngram_entropies = pd.DataFrame({'speaker': ['Show', 'Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'],
'Mean n-gram entropy': [mean_ngram_entropy_show, mean_ngram_entropy_michael,mean_ngram_entropy_dwight,mean_ngram_entropy_jim,mean_ngram_entropy_pam,mean_ngram_entropy_andy,mean_ngram_entropy_toby,mean_ngram_entropy_stanley,mean_ngram_entropy_kelly,mean_ngram_entropy_ryan,mean_ngram_entropy_phyllis,mean_ngram_entropy_oscar,mean_ngram_entropy_darryl,mean_ngram_entropy_jan,mean_ngram_entropy_creed,mean_ngram_entropy_meredith,mean_ngram_entropy_angela,mean_ngram_entropy_kevin,mean_ngram_entropy_erin]})
mean_ngram_entropies = mean_ngram_entropies.sort_values(by='Mean n-gram entropy', ascending=False)
fig = px.bar(mean_ngram_entropies,
title = 'Mean n-gram Entropy by Speaker',
x='speaker',
y='Mean n-gram entropy',
color='Mean n-gram entropy',
text_auto=True)
fig.update_layout(yaxis_title='Mean n-gram Entropy', xaxis_title='Speaker')
fig.add_shape(
name="show",
showlegend=False,
type="rect",
line=dict(dash="dash"),
x0=-0.4,
x1=0.4,
y0=0,
y1=17.13,
)
fig.show()
The "Show" value, representing the combined entropy across all speakers, is the highest at 17.13, once again showing that the combined entropy across all speakers is higher and that of individuals. Among individual characters, Michael ranks highest at 16.03, followed by Dwight (15.68), Jim (15.23), and Andy (15.16), suggesting their dialogue exhibits the greatest diversity and unpredictability. Pam follows closely at 15.04, with most other characters, including Angela, Erin, and (surprisingly) Kevin, clustering around 13.9. Creed has the lowest mean entropy at 12.79, which could indicate that his line are the most predictable - which is not the case. This probably stems from the fact that his character's dialogues and scenes are sparse.
Overall, this distribution does highlight significant differences in the complexity of dialogue assigned to characters, aligning with their narrative roles and personality traits. Michael’s high entropy reflects his dynamic and erratic personality, while Creed’s low entropy suggests his minimalistic and idiosyncratic contributions.
#saving the image
#fig.write_image('graphs/entropy_n_grams_mean_by_speaker.png', engine="kaleido")
#adding bootstrap confidence intervals for all speakers
fig = go.Figure()
for speaker in mean_ngram_entropies['speaker']:
data = entropies[speaker]
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'{speaker}'))
fig.update_layout(title='Bootstrap Confidence Interval for n-gram Entropy by Speaker')
fig.update_yaxes(title='Mean n-gram Entropy')
fig.show()
The "Show" as a whole exhibits the highest mean entropy, reflecting the diversity and complexity of dialogue across all characters combined. Among individual characters, Michael displays the highest mean entropy with a relatively wide confidence interval, aligning with his unpredictable and often erratic speech patterns. Dwight follows closely, with a similarly elevated mean entropy, likely reflecting the unique and verbose nature of his lines.
Jim and Pam occupy the middle range, with moderately high entropy and narrower confidence intervals, indicating consistency in their roles as central characters with balanced humor and dialogue complexity. Characters like Angela, Erin, and Kevin fall towards the lower end of the spectrum, with tighter confidence intervals, reflecting simpler and more predictable speech patterns that align with their well-defined and somewhat static personalities.
Secondary characters such as Creed and Meredith exhibit the lowest mean entropies, possibly due to their sporadic dialogue that tends to be quirky but concise.
The Word Entropy
Once again, time to define a function!
def word_entropy(text):
words = text.split()
counter = Counter(words)
total = sum(counter.values())
return -sum(count/total * log2(count/total) for count in counter.values())
Now, time to apply it to the entire dataframe.
office['word_entropy'] = office['line_formatted'].apply(word_entropy)
office.head()
#success
| season | episode | title | scene | speaker | line | line_formatted | entropy | entropy_formatted | word_entropy | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Pilot | 1 | Michael | All right Jim. Your quarterlies look very good... | all right jim your quarterlies look very good ... | 4.239712 | 3.999839 | 3.807355 |
| 1 | 1 | 1 | Pilot | 1 | Jim | Oh, I told you. I couldn't close it. So... | oh i told you i couldnt close it so | 3.851149 | 3.364299 | 2.947703 |
| 2 | 1 | 1 | Pilot | 1 | Michael | So you've come to the master for guidance? Is ... | so youve come to the master for guidance is th... | 4.245317 | 3.975739 | 3.807355 |
| 3 | 1 | 1 | Pilot | 1 | Jim | Actually, you called me in here, but yeah. | actually you called me in here but yeah | 3.927418 | 3.675892 | 3.000000 |
| 4 | 1 | 1 | Pilot | 1 | Michael | All right. Well, let me show you how it's done. | all right well let me show you how its done | 3.987594 | 3.695948 | 3.321928 |
Creating dataframes for characters. This step could be avoided, as it is a 1:1 copy of the dataframe creation done during "character entropy" part. However, in this report I wanted to show a step-by-step process of this exploratory analysis, therefore the dataframes have to be created once again after applying the word_entropy function.
#re-creating the dataframes after adding the word_entropy column
michael = office[office['speaker'] == 'Michael']
dwight = office[office['speaker'] == 'Dwight']
jim = office[office['speaker'] == 'Jim']
pam = office[office['speaker'] == 'Pam']
andy = office[office['speaker'] == 'Andy']
toby = office[office['speaker'] == 'Toby']
stanley = office[office['speaker'] == 'Stanley']
kelly = office[office['speaker'] == 'Kelly']
ryan = office[office['speaker'] == 'Ryan']
phyllis = office[office['speaker'] == 'Phyllis']
oscar = office[office['speaker'] == 'Oscar']
darryl = office[office['speaker'] == 'Darryl']
jan = office[office['speaker'] == 'Jan']
creed = office[office['speaker'] == 'Creed']
meredith = office[office['speaker'] == 'Meredith']
angela = office[office['speaker'] == 'Angela']
kevin = office[office['speaker'] == 'Kevin']
erin = office[office['speaker'] == 'Erin']
#calculating mean word entropy for each character
michael_word_entropy = michael['word_entropy'].mean()
dwight_word_entropy = dwight['word_entropy'].mean()
jim_word_entropy = jim['word_entropy'].mean()
pam_word_entropy = pam['word_entropy'].mean()
andy_word_entropy = andy['word_entropy'].mean()
toby_word_entropy = toby['word_entropy'].mean()
stanley_word_entropy = stanley['word_entropy'].mean()
kelly_word_entropy = kelly['word_entropy'].mean()
ryan_word_entropy = ryan['word_entropy'].mean()
phyllis_word_entropy = phyllis['word_entropy'].mean()
oscar_word_entropy = oscar['word_entropy'].mean()
darryl_word_entropy = darryl['word_entropy'].mean()
jan_word_entropy = jan['word_entropy'].mean()
creed_word_entropy = creed['word_entropy'].mean()
meredith_word_entropy = meredith['word_entropy'].mean()
angela_word_entropy = angela['word_entropy'].mean()
kevin_word_entropy = kevin['word_entropy'].mean()
erin_word_entropy = erin['word_entropy'].mean()
#calculating word entropy for the entire show
office_word_entropy = office['word_entropy'].mean()
#print(office_word_entropy)
Calculating mean word entropy by speaker.
word_entropies = pd.DataFrame({'speaker': ['Show','Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin'],
'Word Entropy': [office_word_entropy,michael_word_entropy,dwight_word_entropy,jim_word_entropy,pam_word_entropy,andy_word_entropy,toby_word_entropy,stanley_word_entropy,kelly_word_entropy,ryan_word_entropy,phyllis_word_entropy,oscar_word_entropy,darryl_word_entropy,jan_word_entropy,creed_word_entropy,meredith_word_entropy,angela_word_entropy,kevin_word_entropy]})
#sorting the speakers by word entropy
word_entropies = word_entropies.sort_values(by='Word Entropy', ascending=False)
fig = px.bar(word_entropies,
x='speaker',
y='Word Entropy',
title='Mean Word Entropy by Speaker',
color = 'Word Entropy',
text_auto = True,
)
fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Speaker')
fig.add_shape(
name="show",
showlegend=False,
type="rect",
line=dict(dash="dash"),
x0=5.4,
x1=4.6,
y0=0,
y1=2.5,
)
fig.show()
#saving the image#
#fig.write_image('graphs/entropy_word_by_speaker.png', engine="kaleido")
#adding bootstrap confidence intervals for the speakers
speakers = ['Michael', 'Dwight', 'Jim', 'Pam', 'Andy', 'Toby', 'Stanley', 'Kelly', 'Ryan', 'Phyllis', 'Oscar', 'Darryl', 'Jan', 'Creed', 'Meredith', 'Angela', 'Kevin']
fig = go.Figure()
for speaker in speakers:
data = office[office['speaker'] == speaker]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'{speaker}'))
fig.update_layout(title='Bootstrap Confidence Interval for Word Entropy by Speaker')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
Michael has the highest mean word entropy, which aligns with his unpredictable and verbose speech patterns, often filled with tangents and unique phrasing. Following him, Andy exhibits relatively high entropy, reflecting his tendency to incorporate varied vocabulary and musical quirks into his dialogue.
At the lower end, characters like Kevin and Angela have the smallest mean word entropy, emphasizing their consistent and simple speech patterns. Kevin's lines often rely on humor derived from straightforwardness, while Angela's dialogue remains formal and repetitive, consistent with her rigid personality. Creed, despite his sporadic dialogue, shows a surprisingly wide confidence interval, reflecting the variability of his cryptic and often surreal contributions.
The confidence intervals for Pam and Jim, though modestly lower than characters like Michael or Andy, highlight the stability of their conversational styles, reflecting their central and relatable roles in the series.
#calculating mean word entropy by season
word_entropy_season_one = office[office['season'] == 1]['word_entropy'].mean()
word_entropy_season_two = office[office['season'] == 2]['word_entropy'].mean()
word_entropy_season_three = office[office['season'] == 3]['word_entropy'].mean()
word_entropy_season_four = office[office['season'] == 4]['word_entropy'].mean()
word_entropy_season_five = office[office['season'] == 5]['word_entropy'].mean()
word_entropy_season_six = office[office['season'] == 6]['word_entropy'].mean()
word_entropy_season_seven = office[office['season'] == 7]['word_entropy'].mean()
word_entropy_season_eight = office[office['season'] == 8]['word_entropy'].mean()
word_entropy_season_nine = office[office['season'] == 9]['word_entropy'].mean()
#creating a dataframe for the word entropy by season
seasons_word_entropy = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Word Entropy': [word_entropy_season_one, word_entropy_season_two, word_entropy_season_three, word_entropy_season_four, word_entropy_season_five, word_entropy_season_six, word_entropy_season_seven, word_entropy_season_eight, word_entropy_season_nine]})
#plotting the data
fig = px.bar(seasons_word_entropy,
x='Season',
y='Word Entropy',
title='Word Entropy by Season',
color='Word Entropy',
text_auto=True,)
fig.show()
#saving the image
#fig.write_image('graphs/entropy_word_by_season.png', engine="kaleido")
#adding bootstrap confidence intervals
seasons = office['season'].unique()
seasons.sort()
fig = go.Figure()
for season in seasons:
data = office[office['season'] == season]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Word Entropy by Season')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
In Season 1, the mean word entropy is relatively high, with a broader confidence interval, reflecting the exploratory nature of the characters and their speech patterns as the show establishes its tone and dynamics. As the series progresses into Seasons 2 through 5, there is a slight decrease in word entropy, which stabilizes over these middle seasons. This trend may indicate the characters settling into their more predictable and consistent personalities, as well as the show finding its rhythm in humor and dialogue style.
From Season 6 onwards, word entropy begins to increase slightly, with more pronounced variability in Seasons 8 and 9. This rising trend could correspond to the introduction of new narrative elements, evolving character arcs, and shifts in tone as the series approaches its conclusion. The broader intervals in these later seasons highlight a greater diversity in dialogue, reflecting the show's attempts to refresh its dynamics and explore new themes.
Calculating meand word entropy by season for the main characters.
michael_word_entropy_season_one = michael[michael['season'] == 1]['word_entropy'].mean()
michael_word_entropy_season_two = michael[michael['season'] == 2]['word_entropy'].mean()
michael_word_entropy_season_three = michael[michael['season'] == 3]['word_entropy'].mean()
michael_word_entropy_season_four = michael[michael['season'] == 4]['word_entropy'].mean()
michael_word_entropy_season_five = michael[michael['season'] == 5]['word_entropy'].mean()
michael_word_entropy_season_six = michael[michael['season'] == 6]['word_entropy'].mean()
michael_word_entropy_season_seven = michael[michael['season'] == 7]['word_entropy'].mean()
#creating a data frame
michael_word_entropies = pd.DataFrame({'season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7'],
'Word Entropy': [michael_word_entropy_season_one, michael_word_entropy_season_two, michael_word_entropy_season_three, michael_word_entropy_season_four, michael_word_entropy_season_five, michael_word_entropy_season_six, michael_word_entropy_season_seven]})
fig = px.bar(michael_word_entropies,
x='season',
y='Word Entropy',
title='Michael\'s Word Entropy by Season',
color = 'Word Entropy',
text_auto = True,
range_color=[2, 3.1]
)
fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_michael_word_entropy.png', engine="kaleido")
#adding bootstrap confidence intervals for Michael
seasons = michael['season'].unique()
seasons = seasons[:-1]
fig = go.Figure()
for season in seasons:
data = michael[michael['season'] == season]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Michael\'s Word Entropy by Season')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
In Season 1, Michael’s mean word entropy is noticeably higher, with a wide confidence interval reflecting the exploratory and erratic nature of his character as the central comedic force in the early stages of the series. His unpredictable and eccentric speech aligns with this elevated entropy.
As the show progresses into Seasons 2 through 4, Michael’s entropy stabilizes, showing a decline and narrower confidence intervals. This decrease may signify the development of his character into a more consistent archetype with recurring behaviors and dialogue patterns, including his infamous catchphrases and cringeworthy humor.
In later seasons, from Season 5 onwards, Michael’s entropy rises, reaching its highest point in season 7.The rise in entropy during Season 7, his final season, could be indicative of narrative closure and the exploration of varied tones in his farewell episodes. Overall, it does capture the evolution of Michael’s character and his changing role within the show’s comedic and emotional framework.
dwight_word_entropy_season_one = dwight[dwight['season'] == 1]['word_entropy'].mean()
dwight_word_entropy_season_two = dwight[dwight['season'] == 2]['word_entropy'].mean()
dwight_word_entropy_season_three = dwight[dwight['season'] == 3]['word_entropy'].mean()
dwight_word_entropy_season_four = dwight[dwight['season'] == 4]['word_entropy'].mean()
dwight_word_entropy_season_five = dwight[dwight['season'] == 5]['word_entropy'].mean()
dwight_word_entropy_season_six = dwight[dwight['season'] == 6]['word_entropy'].mean()
dwight_word_entropy_season_seven = dwight[dwight['season'] == 7]['word_entropy'].mean()
dwight_word_entropy_season_eight = dwight[dwight['season'] == 8]['word_entropy'].mean()
dwight_word_entropy_season_nine = dwight[dwight['season'] == 9]['word_entropy'].mean()
#creating a dataframe
dwight_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
'Word Entropy': [dwight_word_entropy_season_one, dwight_word_entropy_season_two, dwight_word_entropy_season_three, dwight_word_entropy_season_four, dwight_word_entropy_season_five, dwight_word_entropy_season_six, dwight_word_entropy_season_seven, dwight_word_entropy_season_eight, dwight_word_entropy_season_nine]})
fig = px.bar(dwight_word_entropies,
x='Season',
y='Word Entropy',
title='Dwight\'s Word Entropy by Season',
color = 'Word Entropy',
text_auto = True,
range_color=[2, 3.1]
)
fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_dwight_word_entropy.png', engine="kaleido")
#adding bootstrap confidence intervals for Dwight
seasons = dwight['season'].unique()
fig = go.Figure()
for season in seasons:
data = dwight[dwight['season'] == season]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Dwight\'s Word Entropy by Season')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
In Season 1, Dwight's mean word entropy is relatively high with a very wide confidence interval, reflecting the early establishment of his eccentric, verbose, and often unpredictable personality. The wide confidence interval suggests variability as the character’s quirks and speech patterns are being defined.
As the series progresses into Seasons 2 through 6, Dwight's word entropy increases, but stabilizes, which may correspond to the development of more consistent patterns in his dialogue, often centered around his authoritarian demeanor, workplace rivalries, and survivalist anecdotes. This period aligns with Dwight becoming a more predictable yet dynamic figure, rooted in his distinctive traits.
In Seasons 7 through 9, Dwight's word entropy increases significantly, with Season 9 showing the highest levels. This shift could reflect the expansion of Dwight’s role in the show’s narrative, including his romantic subplot with Angela and his eventual rise to managerial responsibilities.
jim_word_entropy_season_one = jim[jim['season'] == 1]['word_entropy'].mean()
jim_word_entropy_season_two = jim[jim['season'] == 2]['word_entropy'].mean()
jim_word_entropy_season_three = jim[jim['season'] == 3]['word_entropy'].mean()
jim_word_entropy_season_four = jim[jim['season'] == 4]['word_entropy'].mean()
jim_word_entropy_season_five = jim[jim['season'] == 5]['word_entropy'].mean()
jim_word_entropy_season_six = jim[jim['season'] == 6]['word_entropy'].mean()
jim_word_entropy_season_seven = jim[jim['season'] == 7]['word_entropy'].mean()
jim_word_entropy_season_eight = jim[jim['season'] == 8]['word_entropy'].mean()
jim_word_entropy_season_nine = jim[jim['season'] == 9]['word_entropy'].mean()
#creating a dataframe
jim_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
'Word Entropy': [jim_word_entropy_season_one, jim_word_entropy_season_two, jim_word_entropy_season_three, jim_word_entropy_season_four, jim_word_entropy_season_five, jim_word_entropy_season_six, jim_word_entropy_season_seven, jim_word_entropy_season_eight, jim_word_entropy_season_nine]})
fig = px.bar(jim_word_entropies,
x='Season',
y='Word Entropy',
title='Jim\'s Word Entropy by Season',
color = 'Word Entropy',
text_auto = True,
range_color=[2, 3.1]
)
fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_jim_word_entropy.png', engine="kaleido")
#adding bootstrap confidence intervals for Jim
seasons = jim['season'].unique()
fig = go.Figure()
for season in seasons:
data = jim[jim['season'] == season]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Jim\'s Word Entropy by Season')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
In Season 1, Jim's mean word entropy is among the highest, suggesting a degree of unpredictability in his dialogue as his character's sarcastic and observational humor is being established. The wide confidence interval reflects variability as the show explores his dynamic with other characters.
In Seasons 2 through 5, Jim's word entropy stabilizes at lower levels, corresponding to the maturation of his role as a steady counterpart to the chaos around him. His dialogue becomes more focused on recurring elements such as pranks on Dwight and his evolving relationship with Pam, contributing to a more consistent and predictable pattern.
From Season 6 onward, entropy begins to rise again, peaking in Season 9. This increase aligns with Jim's expanded narrative arc, including his entrepreneurial endeavors and personal challenges, which likely add complexity and variety to his speech patterns. The broader intervals in the later seasons reflect the variability in his character's dialogue as he navigates these new storylines.
pam_word_entropy_season_one = pam[pam['season'] == 1]['word_entropy'].mean()
pam_word_entropy_season_two = pam[pam['season'] == 2]['word_entropy'].mean()
pam_word_entropy_season_three = pam[pam['season'] == 3]['word_entropy'].mean()
pam_word_entropy_season_four = pam[pam['season'] == 4]['word_entropy'].mean()
pam_word_entropy_season_five = pam[pam['season'] == 5]['word_entropy'].mean()
pam_word_entropy_season_six = pam[pam['season'] == 6]['word_entropy'].mean()
pam_word_entropy_season_seven = pam[pam['season'] == 7]['word_entropy'].mean()
pam_word_entropy_season_eight = pam[pam['season'] == 8]['word_entropy'].mean()
pam_word_entropy_season_nine = pam[pam['season'] == 9]['word_entropy'].mean()
#creating a dataframe
pam_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
'Word Entropy': [pam_word_entropy_season_one, pam_word_entropy_season_two, pam_word_entropy_season_three, pam_word_entropy_season_four, pam_word_entropy_season_five, pam_word_entropy_season_six, pam_word_entropy_season_seven, pam_word_entropy_season_eight, pam_word_entropy_season_nine]})
fig = px.bar(pam_word_entropies,
x='Season',
y='Word Entropy',
title='Pam\'s Word Entropy by Season',
color = 'Word Entropy',
text_auto = True,
range_color=[2, 3.1]
)
fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')
fig.show()
#saving the image
#fig.write_image('graphs/entropy_pam_word_entropy.png', engine="kaleido")
#addding bootstrap confidence intervals for Pam
seasons = pam['season'].unique()
fig = go.Figure()
for season in seasons:
data = pam[pam['season'] == season]['word_entropy']
data = (data,)
res = bootstrap(data, np.mean, confidence_level=0.9, random_state=42)
fig.add_trace(go.Box(y=(res.confidence_interval.low, res.confidence_interval.high), name=f'Season {season}'))
fig.update_layout(title='Bootstrap Confidence Interval for Pam\'s Word Entropy by Season')
fig.update_yaxes(title='Mean Word Entropy')
fig.show()
In Season 1, Pam’s word entropy is relatively low, which aligns with her reserved and somewhat hesitant personality during the early episodes. The broad confidence interval reflects variability as her character adjusts to the narrative and interactions with other characters.
As the series progresses into Seasons 2 through 5, Pam’s word entropy begins to rise slightly but remains relatively stable. This stabilization corresponds to her growing confidence and more defined role in the show, particularly through her relationship with Jim and her evolving aspirations.
In Seasons 6 through 9, Pam’s word entropy shows a more pronounced increase, peaking in the final seasons. This trend is indicative of her expanding narrative arc, including her career challenges and family life, which add complexity to her dialogue.
Conclusions
Hypotheses reminder:
- Entropy measures of dialogue in The Office scripts correspond to the personality traits and behavioral complexity of individual characters.
- The Shannon entropy of a character’s lines is positively associated with how linguistically diverse and complex their personality appears on-screen. Characters with high dialogue entropy will appear more unpredictable or multifaceted, while characters with low dialogue entropy will appear more consistent or stereotypical.
- Characters with higher variance in entropy across episodes or seasons demonstrate more personality development (e.g., evolving speech patterns), whereas characters with stable entropy are portrayed as static or consistent.
- Entropy measures of The Office are indicative of the narrative pacing and thematic complexity of the show.
- Seasons with higher word entropy (measuring the diversity of word usage in a given unit of dialogue) reflect faster or more complex pacing, while lower word entropy reflects slower, more focused narrative flow.
- Seasons with higher combined entropy across all characters’ lines indicate a broader range of interactions, while episodes with lower combined entropy reflect more tightly focused plots or limited interactions.
The entropy analyses of The Office scripts align closely with these hypotheses, providing compelling evidence that dialogue complexity mirrors both the personality traits of individual characters and the broader narrative structure of the show. Characters like Michael and Dwight, who are known for their dynamic and unpredictable behaviors, display higher levels of dialogue entropy, which reflects their multifaceted personalities and the evolution of their roles within the series. Conversely, the steadier entropy patterns observed in characters like Angela or Kevin correspond to their more stable and consistent portrayals, reinforcing the connection between linguistic diversity and perceived character complexity. Characters with fluctuating entropy, such as Pam, further highlight how changes in dialogue can capture personal growth or shifts in narrative focus.
On a larger scale, the show's overall entropy trends reveal how its storytelling evolves across seasons. Earlier seasons, characterized by lower entropy, focus on tightly knit plots and the foundational establishment of character relationships. As the series progresses, increasing entropy reflects a broadening of the narrative scope, with more complex interactions, faster pacing, and an expanded thematic range. Later seasons, with their higher combined entropy, suggest a deliberate effort to explore diverse storylines and character arcs, providing a richer and more varied tapestry of dialogue and events. These findings reinforce the idea that entropy is a powerful tool for understanding both character depth and the show’s thematic and narrative complexity, lending quantitative support to the richness of The Office's storytelling.
I hope you enjoyed this analysis!¶